glot_statusDAL tutorial - Week 5
Data transformation
1 Data transformation
Data transformation is a fundamental aspect of data analysis.
After the data you need to use is imported into R, you will have to filter rows, create new columns, or join data frames, among many other transformation operations.
In this tutorial we will learn how to filter() the data and mutate() or create new columns. In Week 6 (after Flexible Learning week) you will learn how to obtain summary measures and how to count occurrences using the summarise(), group_by() and count() functions.
2 Filter
Filtering data based on specific criteria couldn’t be easier with filter(), from the dplyr package (one of the tidyverse core packages),
Let’s work with the coretta2022/glot_status data frame. It’s an .rds file, so you need to use the readRDS() function. Go ahead and read the data into glot_status.
The glot_status data frame contains the endangerment status for 7,845 languages from Glottolog. There are thousands of languages in the world, but most of them are losing speakers, and some are already no longer spoken. The endangerment status of a language in the data is on a scale from not endangered (languages with large populations of speakers) through threatened, shifting and nearly extinct, to extinct (languages that have no living speakers left).
Before we can move on onto filtering data, we first need to learn about logical operators.
2.1 Logical operators
There are four main logical operators:
x == y:xequalsy.x != y:xis not equal toy.x > y:xis greater thany.x < y:xis smaller thany.
Logical operators return TRUE or FALSE depending on whether the statement they convey is true or false. Remember, TRUE and FALSE are logical values.
Try these out in the Console:
# This will return FALSE
1 == 2[1] FALSE
# FALSE
"apples" == "oranges"[1] FALSE
# TRUE
10 > 5[1] TRUE
# FALSE
10 > 15[1] FALSE
# TRUE
3 < 4[1] TRUE
Now let’s see how these work with filter()!
2.2 The filter() function
Filtering in R with the tidyverse is straightforward. You can use the filter() function.
filter() takes one or more statements with logical operators.
Let’s try this out. The following code filters the status column so that only the extinct status is included in the new data frame extinct.
You’ll notice we are using the pipe |> to transfer the data into the filter() function; the output of the filter function is assigned <- to extinct. The flow might seem a bit counter-intuitive but you will get used to think like this when writing R code soon enough!
extinct <- glot_status |>
filter(status == "extinct")
extinctNeat! What if we want to include all statuses except extinct? Easy, we use the non-equal operator !=.
not_extinct <- glot_status |>
filter(status != "extinct")
not_extinctAnd if we want only non-extinct languages from South America? We can include multiple statements separated by a comma!
south_america <- glot_status |>
filter(status != "extinct", Macroarea == "South America")
south_americaCombining statements like this will give you only those rows where both conditions apply. You can add as many statements as you need.
Now try to filter the data so that you include only not_endangered languages from all macro-areas except Eurasia. This time, don’t save the output to a new data frame. What happens? Where is the output shown?
glot_status |>
filter(...)This is all great, but what if we want to include more than one status or macro-area?
To do that we need another operator: %in%.
2.3 The %in% operator
Try these in the Console:
# TRUE
5 %in% c(1, 2, 5, 7)[1] TRUE
# FALSE
"apples" %in% c("oranges", "bananas")[1] FALSE
But %in% is even more powerful because the value on the left does not have to be a single value, but it can also be a vector! We say %in% is vectorised because it can work with vectors (most functions and operators in R are vectorised).
# TRUE, TRUE
c(1, 5) %in% c(4, 1, 7, 5, 8)[1] TRUE TRUE
stocked <- c("durian", "bananas", "grapes")
needed <- c("durian", "apples")
# TRUE, FALSE
needed %in% stocked[1] TRUE FALSE
Try to understand what is going on in the code above before moving on.
2.4 Now filter the data
Now we can filter glot_status to include only the macro-areas of the Global South and only languages that are either “threatened”, “shifting”, “moribund” or “nearly_extinct”. I have started the code for you, you just need to write the line for filtering status.
global_south <- glot_status |>
filter(
Macroarea %in% c("Africa", "Australia", "Papunesia", "South America"),
...
)This should not look too alien! The first statement, consonant %in% c("p", "t", "k") looks at the consonant column and, for each row, it returns TRUE if the current row value is in c("p", "t", "k"), and FALSE if not.
3 Bar charts
We will first create a plot with counts of the number of languages in global_south by their endangerment status and then a plot where we also split the counts by macro-area.
3.1 Number of languages of the Global South by status
To create a bar chart, you can use the geom_bar() geometry.
Go ahead and complete the following code to create a bar chart.
global_south |>
ggplot(aes(x = status)) +
...Note how we’re using |> to pipe the glot_status data frame into the ggplot() function. This works because ggplot()’s first argument is the data, and piping is a different way of providing the first argument to a function.
As mentioned above, the counting for the y-axis is done automatically. R looks in the status column and counts how many times each value in the column occurs in the data frame.
If you did things correctly, you should get the following plot.

The x-axis is now status and the y-axis corresponds to the number of languages by status (count). As mentioned above, count is calculated under the hood for you (you will learn how to count levels with count() later in the course).
You could write a description of the plot that goes like this:
The number of languages in the Global South by endangered status is shown as a bar chart in Figure 1. Among the languages that are endangered, the majority are threatened or shifting.
What if we want to show the number of languages by endangerment status within each of the macro-areas that make up the Global South? Easy! You can make a stacked bar chart, but we will get to that after we first learn about mutate().
4 Mutate
5 Stacked bar charts
A special type of bar charts are the so-called stacked bar charts.
To create a stacked bar chart, you just need to add a new aesthetic mapping to aes(): fill. The fill aesthetic lets you fill bars or areas with different colours depending on the values of a specified column.
Let’s make a plot on language endangerment by macro-area.
Complete the following code by specifying that fill should be based on status.
global_south |>
ggplot(aes(x = Macroarea, ...)) +
geom_bar()You should get the following.

A write-up example:
Figure 3 shows the number of languages by geographic macroarea, subdivided by endangerment status. Africa, Eurasia and Papunesia have substantially more languages than the other areas.
In the plot above it is difficult to assess whether different macroareas have different proportions of endangerment. This is because the overall number of languages per area differs between areas.
A solution to this is to plot proportions instead of raw counts.
You could calculate the proportions yourself, but there is a quicker way: using the position argument in geom_bar().
You can plot proportions instead of counts by setting position = "fill" inside geom_bar(), like so:

The plot now shows proportions of languages by endangerment status for each area separately. (Note that the y-axis label is still “count” but it is in fact proportions; you will learn how to change labels next week).
With this plot it is easier to see that different areas have different proportions of endangerment. In writing:
Figure 4 shows proportions of languages by endangerment status for each macroarea. Australia, South and North America have a substantially higher proportion of extinct languages than the other areas. These areas also have a higher proportion of near extinct languages. On the other hand, Africa has the greatest proportion of non-endangered languages followed by Papunesia and Eurasia, while North and South America are among the areas with the lower proportion, together with Australia which has the lowest.